feat(otel): instrument runtime with GenAI semantic conventions by tdabasinskas · Pull Request #2620 · docker/docker-agent

tdabasinskas · 2026-05-04T07:49:26Z

Adds end-to-end OpenTelemetry instrumentation following the GenAI semantic conventions:

Provider-layer chat/embeddings/rerank CLIENT spans with gen_ai.* attributes and the gen_ai.client.token.usage / operation.duration histograms.
Runtime spans (runtime.session, runtime.stream, runtime.fallback, runtime.tool.call, runtime.run_skill, runtime.task_transfer, runtime.handoff, background_agent.run).
MCP client + server spans with params._meta propagation, plus OAuth flow spans.
A2A endpoints wrapped with otelhttp and marked as invoke_agent.
Hook executor span with verdict/decision/reason annotation; subprocess trace context propagation for hooks, LSP servers, and sandbox docker exec.
Memory, RAG, sessiontitle, evaluation, anthropic-specific spans.
Built-in tool internals (shell, filesystem, fetch, lsp, codemode, ...) surface their work as span attributes.
W3C trace context + baggage propagation across all HTTP servers and clients.
Standard OTel resource attributes (service.*, host.*, process.*, os.type)

This PR wires two opt-in env vars beyond the default OTel SDK ones:

OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT — capture prompts, responses, tool arguments and tool results as span attributes. Off by default (PII surface).
OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental — emit only the spec-defined gen_ai.* keys. Default is dual-emit (both gen_ai.* and the legacy tool.name / agent / session.id keys), so existing dashboards keep working alongside spec-aware tooling.

The diff is large — ~50 files, ~5k lines. It's split into 10 topical commits (telemetry primitives → SDK init → providers → runtime → hooks → MCP → A2A → servers/cold-start → memory/RAG → tool internals) so each commit is independently reviewable. Most of the volume is in the new pkg/telemetry/genai/ and pkg/telemetry/mcp/ packages, which are pure helpers; the surface-area changes elsewhere are 1-3 lines per call site.

dgageot · 2026-05-04T17:33:21Z

@tdabasinskas not sure why, GitHub doesn't want to merge this one, because of hypothetical merge conflicts. Could you rebase?

tdabasinskas · 2026-05-04T18:40:45Z

@tdabasinskas not sure why, GitHub doesn't want to merge this one, because of hypothetical merge conflicts. Could you rebase?

Done!

aheritier · 2026-05-05T21:59:52Z

/review

tdabasinskas · 2026-05-06T13:58:52Z

/review

I don't think that worked 😅

aheritier · 2026-05-06T20:52:23Z

/review

aheritier

LGTM. Clean design, solid thread safety, good spec adherence. The inline comments are all non-blocking suggestions for follow-up.

docker-agent · 2026-05-06T21:01:03Z

❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs.

dgageot · 2026-05-07T14:54:00Z

@tdabasinskas can you rebase one more time and I'll review it?

tdabasinskas · 2026-05-07T15:18:18Z

@tdabasinskas can you rebase one more time and I'll review it?

Done!

aheritier

Re-approving — my prior approval was dismissed by the merge of upstream/main into the branch, but there are zero new author code changes since a4ce95e8. All three of my previous comments were addressed and the threads are resolved. CI is green on the merge commit.

Original assessment stands: clean design, solid thread safety, good GenAI semconv adherence. LGTM.

tdabasinskas · 2026-05-08T05:54:53Z

@Tabarnakle Could you allow me to push a few changes on top of your commits?

I assume you wanted to tag me here :)

I gave write access to the repo - feel free to push changes.

@tdabasinskas also, I wonder how we could reduce the impact of those changes. It's very overwhelming

Yes, I know the PR is quite big. From the codebase perspective, there's a lot of impact.

I did try splitting this into few bigger commits, that should be reviewable independently. I guess I could split the whole PR into smaller PRs if that helps reviewability. Or if you'd prefer to push your own restructuring on top, that works too.

From the user perspective, all of this is gated under --otel flag, so I would assume the impact is minimal.

Agree with you about the size/complexity @dgageot My other concern is about how to test it and also validate that information sent are ok in term of PII / RGPD / ... It's maybe ok to grab so many data when you collect them within a corporate environment but not for the telemetry we enable by default (hostname, etc) https://docker.github.io/docker-agent/community/telemetry/

Most of the things added here are covered by new unit tests. Happy to add an e2e test that runs an agent end-to-end and asserts on the resulting span tree, if you think that would help.

Regarding the PII, correct me if I'm wrong, but I understood that the Docker telemetry (enabled by default) is in no way related to the telemetry exposed by OTel. OTel telemetry requires not only --otel flag to be enabled, but also OTEL_EXPORTER_OTLP_ENDPOINT environment variable set pointing to working OTel collector. In addition to both of these, capturing of message contents (PII) is disabled by default and can only be enabled by explicit OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true environment variable.

tdabasinskas · 2026-05-12T05:29:37Z

Hi @dgageot, @aheritier,

Do you have any updates on how to move with this further? I see there are already conflicts again - I can resolve them, but since it drifts quite quickly, would be good to know if we planning to merge this.

aheritier · 2026-05-12T21:17:58Z

/review

aheritier · 2026-05-12T21:18:05Z

needs to be rebased (at the minimum). Moving it to draft

docker-agent · 2026-05-12T21:48:39Z

❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs.

- `pkg/telemetry/genai/` provides the GenAI semantic-conventions surface: span helpers (`ChatSpan`, `EmbeddingSpan`, `FallbackSpan`, `SandboxSpan`, runtime helpers), attribute / operation-name / provider-name constants per the OTel GenAI semconv, conversation-id baggage round-trippers, error classification, content-capture gating (`OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`), stability gating (`OTEL_SEMCONV_STABILITY_OPT_IN`), `gen_ai.client.token.usage` and operation-duration histograms, the `gen_ai.evaluation.result` log emitter, and process-boundary helpers (`InjectSandboxEnv`, `InjectTraceContextEnv`) - `pkg/telemetry/mcp/` provides MCP-specific telemetry: `ConversationIDFromBaggage`, span starters for client / server, `params._meta` propagation carrier, attribute constants, and metrics - Test files cover content gating, stability defaults, conversation propagation, and span lifecycle invariants

- `cmd/root/otel.go`: stand up `TracerProvider` / `MeterProvider` / `LoggerProvider` from a single `initOTelSDK` entry, configure OTLP/HTTP exporters with explicit-scheme endpoint normalization, set the global W3C trace-context + baggage propagator unconditionally, flush providers in dependency order, attach `service.*` / `host.*` / `process.*` / `os.type` / `host.arch` resource attributes, and use `AlwaysSample` so local agent sessions are not dropped by an upstream sampling decision - `pkg/httpclient/client.go`: add a `WrapWithOTel` round-tripper gated on a single `atomic.Bool` flipped by `initOTelSDK` (avoids the prior mismatch between `--otel` and the otelhttp wrap), plus `TracedDefaultClient` / `TracedClient` helpers for one-off HTTP calls - `cmd/root/sandbox.go`: open a host-side `sandbox.exec` span and inject the active W3C trace context as `-e KEY=VALUE` flags so processes inside the container chain onto the host trace - `cmd/root/new.go`, `cmd/root/otel_test.go`: wire tracer scope and cover the endpoint normalization / localhost detection cases - `go.mod` / `go.sum`: pull in `go.opentelemetry.io/otel` SDK + OTLP/HTTP exporters

…s and metrics - `pkg/model/provider/instrument.go`: decorator that wraps any `Provider` with a `chat {model}` CLIENT span (per OTel GenAI semconv), opt-in capture of `gen_ai.input.messages` / `gen_ai.output.messages` / `gen_ai.tool.definitions`, request/response attributes including the Anthropic spec-sum input-token computation (input + cache_read + cache_creation), `gen_ai.client.token.usage` histogram, and `gen_ai.client.operation.duration` histogram. Six wrapper variants preserve the EmbeddingProvider / RerankingProvider capability surfaces so RAG fallbacks round-trip correctly - `pkg/model/provider/factory.go`, `factory_test.go`: route construction through the decorator - `pkg/model/provider/anthropic/client.go`, `files.go`: add `anthropic.tokens.count` and `anthropic.files.get_or_upload` spans for the overflow-retry token-counting path and the file-upload cache-or-create path; drop the unnecessary `string(model)` cast

…n, skills, and background agents - `pkg/runtime/loop.go`: open `runtime.session` and `runtime.stream` INTERNAL spans seeded with `gen_ai.conversation.id` baggage at session start; mark the session span with `error.type=loop_detected` + `codes.Error` when the loop detector terminates - `pkg/runtime/fallback.go`, `pkg/runtime/cache.go`: wrap the fallback chain with a `runtime.fallback` span carrying primary/final model, attempts, outcome, cooldown state; record provider-cache hit/backing on the cache span - `pkg/runtime/agent_delegation.go`: emit `runtime.task_transfer` and `runtime.handoff` spans with `gen_ai.operation.name=invoke_agent` and `gen_ai.agent.name` - `pkg/runtime/skill_runner.go`: emit `invoke_workflow {skill}` per spec - `pkg/runtime/toolexec/dispatcher.go`: open `runtime.tool.call` and `runtime.tool.handler` spans with the GenAI execute_tool semconv, capture `gen_ai.tool.call.{arguments,result}` under the content-capture opt-in, and stamp `cagent.approval.{decision,source}` from `notifyApproval` so denied / canceled / read-only-allowed calls are distinguishable in trace dashboards - `pkg/runtime/compactor/compactor.go`: wrap compaction with a span that carries summary tokens and cost - `pkg/tools/builtin/agent/agent.go`: open a `background_agent.run` root span with a link back to the spawning context, and stamp `gen_ai.conversation.id` from baggage so the span participates in conversation-scoped queries - `pkg/tools/startable.go`, `pkg/toolinstall/registry.go`: wrap toolset Start with a `toolset.start` span so capability discovery latency is attributable

…race context - `pkg/hooks/executor.go`: open a single `hook.{event}` INTERNAL span per Dispatch covering every matched hook, then `annotateHookSpan` stamps the aggregated `Result` so denied / asked / allowed / modified-input / summary-provided cases are distinguishable. Verdict booleans and the structured decision/reason are unconditional; free-text `message` / `additional_context` / `system_message` / `summary` are gated on `OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT` - `pkg/hooks/handler.go`: append `genai.InjectTraceContextEnv(ctx)` to the hook subprocess env so script-driven hooks that emit OTel spans (or call instrumented CLIs / LLM endpoints) chain onto the parent `hook.{event}` span instead of producing orphaned roots

- `pkg/mcp/server.go`: route the MCP HTTP transport through `otelhttp.NewHandler` and `otelmcp.StartServer` so inbound requests carry `traceparent` / `baggage` and emit a SERVER span per call - `pkg/tools/mcp/session_client.go`: wrap MCP client calls (`tools/list`, `tools/call`, `prompts/list`) with CLIENT spans using the params._meta propagation carrier. Iterator wrappers open the span inside the iterator closure (not at call time) so unused iterators do not leak spans, and end on every exit path including early `yield` returns - `pkg/tools/mcp/oauth.go`, `oauth_helpers.go`, `oauth_login.go`, `oauth_server.go`: wrap interactive OAuth flow and token refresh with `oauth.flow` / `oauth.token.refresh` CLIENT spans, route metadata HTTP calls through `httpclient.TracedClient` / `TracedDefaultClient`, and emit `oauth.step` span events at each network sub-step boundary (`fetch_protected_resource_metadata`, `fetch_authorization_server_metadata`, `dynamic_client_registration`, `request_authorization_code`, `token_exchange`) so a failure can be attributed to a specific stage without descending into HTTP children

…nt semconv - `pkg/a2a/server.go`: wrap the agent-card and JSON-RPC endpoints with `otelhttp.NewHandler` so inbound A2A requests extract `traceparent` / `tracestate` / `baggage` and emit a SERVER span. The outer `agent-a2a` server wrap covers any auxiliary routes - `pkg/a2a/adapter.go`: in `runDockerAgent`, decorate the active SERVER span with `gen_ai.operation.name=invoke_agent`, `gen_ai.agent.name`, and `cagent.agent.name`. Wires the runtime tracer scope so per-invocation `runtime.session` / `runtime.stream` / `runtime.tool.call` chain onto the inbound A2A span instead of starting fresh trace ids per request

…ints, and add cold-start spans - `pkg/server/server.go`: wrap the agent-api Echo handler with `otelhttp.NewHandler` so inbound API requests extract `traceparent` / `tracestate` / `baggage` and the runtime spans started downstream chain onto the calling client trace - `pkg/server/session_manager.go`: wire the runtime tracer scope into per-session runtime construction; open a `session.runtime_init` INTERNAL span on the cold path (team load + runtime construction) so per-request first-use latency is attributable. Cached hits skip the span — they are a pointer load - `pkg/chatserver/server.go`, `pkg/chatserver/runtime_pool.go`: wrap the chat completions HTTP server with `otelhttp.NewHandler` and propagate the runtime tracer through the per-session pool - `pkg/teamloader/teamloader.go`: open a `teamloader.load` INTERNAL span around `LoadWithConfig` so the cold-start path (config parse, model alias resolution, OCI agent pulls, toolset starts) becomes attributable - `pkg/acp/agent.go`: wire the runtime tracer into the ACP entry point so its sub-spans share scope with CLI / API runs

- `pkg/memory/database/sqlite/sqlite.go`: open `memory.{op}` spans on `AddMemory`, `SearchMemories`, etc., with named-return error capture so failures attach to the span via `RecordError`. The search path additionally emits a `retrieval` semconv span for cross-tool dashboards - `pkg/rag/manager.go`: open `retrieval` (semconv) spans on `Query`, plus `rag.init` / `rag.reindex` / `rag.file_watcher` for lifecycle visibility - `pkg/sessiontitle/generator.go`: wrap title generation with a `sessiontitle.generate` span; named-return errors fold onto the span on failure - `pkg/evaluation/judge.go`: emit `gen_ai.evaluation.result` log events from the LLM-as-judge evaluator with score / explanation / error.type, linked to the active span via context for cross-signal join

- `pkg/tools/builtin/shell.go`, `script_shell.go`: stamp `cagent.tool.{shell,script_shell}.{cmd,cwd,timeout_seconds}` on the active `runtime.tool.handler` span. Cmd ships unconditionally because it is the main signal of what the agent did; redact at the OTel collector if commands carry secrets - `pkg/tools/builtin/filesystem.go`: stamp `cagent.tool.filesystem.{op,path,paths,path_count}` covering all file operations. Paths ship unconditionally for the same incident-response reason - `pkg/tools/builtin/fetch.go`: stamp `cagent.tool.fetch.{urls,url_count,format}`; each fetched URL still emits its own HTTP CLIENT child span via `httpclient.WrapWithOTel` - `pkg/tools/builtin/lsp.go`: wrap every tool from `lspTool` so each LSP RPC stamps `cagent.tool.lsp.{tool,read_only}` on the parent span - `pkg/tools/builtin/lsp_lifecycle.go`: inject `genai.InjectTraceContextEnv(ctx)` into the LSP server spawn env so OTel-aware language servers chain onto the agent trace - `pkg/tools/builtin/openapi.go`, `pkg/tools/builtin/api.go`: route the user-facing HTTP clients through `httpclient.WrapWithOTel(remote.NewTransport(ctx))` so each API call emits a CLIENT span and propagates `traceparent` - `pkg/tools/codemode/exec.go`: stamp `cagent.tool.codemode.{script,script_length,tool_call_count}` so a code-mode turn is visible as "ran N lines of JS that called M tools"

…tion Wrap the HTTP transport chain with `httpclient.WrapWithOTel` so every outbound MCP request injects W3C `traceparent` headers and creates an HTTP CLIENT span. Without this wrap, the streamable-HTTP/SSE transports the gomcp SDK builds send raw POST/GET requests that never chain onto the calling cagent span—the downstream MCP server's spans then live in a separate root trace, breaking end-to-end observability for any agent talking to a remote MCP server. `WrapWithOTel` is a no-op when OTel is disabled at runtime, so the laptop-mode default stays unchanged.

Every toolset goes through tools.WithName in the team-loader registry, which sandwiches a *tools.namedToolSet between the StartableToolSet and the actual implementation. %T on the embedded ToolSet therefore always reported *tools.namedToolSet regardless of whether the inner toolset was MCP, A2A, a builtin, or anything else - so the attribute could never answer the question it exists to answer ("which kind of toolset is starting right now?"). Unwrap once before formatting, mirroring what DescribeToolSet already does for the same reason. Now the attribute reads *mcp.Toolset, *builtin.ShellTool, etc., so a toolset.start without HTTP children is immediately distinguishable from a remote MCP whose POSTs are missing for some other reason.

Record tool counts at two key points in the execution flow: - Session span: total tools available after exclusion filters - MCP list span: tools successfully yielded by each server These attributes enable quick analysis of tool availability without inspecting nested spans or JSON-RPC payloads. The MCP count preserves partial results when iteration terminates early.

…errors Introduce a `classifyByStatusCode` helper that probes for an HTTP status code via a `StatusCode() int` method before falling back to substring matching. This prevents false positives when error messages incidentally contain strings like "401", "403", or "429" in request IDs, byte counts, or status-line fragments. Providers that expose HTTP status codes through a structured interface now get classified from the structural signal, while text-only errors continue to use the existing heuristic. Also add documentation clarifying that `getInstruments` binds to the global MeterProvider on first call via `sync.Once`, which affects test setup requirements.

tdabasinskas · 2026-05-13T05:27:17Z

needs to be rebased (at the minimum). Moving it to draft

Rebased.

rumpl · 2026-05-13T08:01:13Z

I'm trying this one out and I see these warnings, anything we can do to fix these?

rumpl · 2026-05-13T08:04:13Z

Note, this doesn't happen all the time, I just started a new session and there were no warnings

tdabasinskas · 2026-05-13T08:10:47Z

I'm trying this one out and I see these warnings, anything we can do to fix these?

Note, this doesn't happen all the time, I just started a new session and there were no warnings

Hi @rumpl,

The warnings seem to be from Jaeger's clock-skew adjuster, not from docker-agent. They fire when Jaeger queries the trace before all of the span batches have flushed (children get there before their parent).

Does re-loading the trace later (after 30s or so) make the warnings disappear? Could you see if enabling more eager flushing via OTEL_BSP_SCHEDULE_DELAY=1000 (1s instead of 5s) resolves the warnings?

tdabasinskas requested a review from a team as a code owner May 4, 2026 07:49

tdabasinskas mentioned this pull request May 4, 2026

OTEL, again #393

Open

tdabasinskas marked this pull request as draft May 4, 2026 07:58

tdabasinskas marked this pull request as ready for review May 4, 2026 08:52

tdabasinskas force-pushed the feat/otel-genai-semconv branch from fa4a01d to 2a69313 Compare May 4, 2026 11:16

tdabasinskas force-pushed the feat/otel-genai-semconv branch from 2a69313 to 9b08feb Compare May 4, 2026 18:40

tdabasinskas force-pushed the feat/otel-genai-semconv branch 2 times, most recently from e7194da to b6a181b Compare May 5, 2026 08:02

tdabasinskas marked this pull request as draft May 5, 2026 12:26

tdabasinskas marked this pull request as ready for review May 5, 2026 13:31

aheritier previously approved these changes May 6, 2026

View reviewed changes

Comment thread pkg/telemetry/genai/errors.go

Comment thread pkg/telemetry/genai/span.go

Comment thread pkg/telemetry/genai/metrics.go

tdabasinskas dismissed aheritier’s stale review via 4cbee6b May 7, 2026 05:34

tdabasinskas requested a review from aheritier May 7, 2026 07:57

aheritier removed priority:medium Normal priority, standard sprint work effort:large Cross-cutting concern, complex design, broad codebase knowledge required labels May 7, 2026

aheritier previously approved these changes May 7, 2026

View reviewed changes

aheritier added effort:large Cross-cutting concern, complex design, broad codebase knowledge required go Pull requests that update go code labels May 7, 2026

tdabasinskas dismissed aheritier’s stale review via d48faf7 May 12, 2026 06:07

aheritier added the status/needs-design Requires architectural discussion or design review label May 12, 2026

aheritier marked this pull request as draft May 12, 2026 21:18

aheritier removed effort:large Cross-cutting concern, complex design, broad codebase knowledge required go Pull requests that update go code labels May 12, 2026

tdabasinskas added 14 commits May 13, 2026 08:04

tdabasinskas force-pushed the feat/otel-genai-semconv branch from d48faf7 to 946df1b Compare May 13, 2026 05:25

tdabasinskas marked this pull request as ready for review May 13, 2026 05:27

Conversation

tdabasinskas commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dgageot commented May 4, 2026

Uh oh!

tdabasinskas commented May 4, 2026

Uh oh!

aheritier commented May 5, 2026

Uh oh!

tdabasinskas commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aheritier commented May 6, 2026

Uh oh!

aheritier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

docker-agent Bot commented May 6, 2026

Uh oh!

dgageot commented May 7, 2026

Uh oh!

tdabasinskas commented May 7, 2026

Uh oh!

aheritier left a comment

Choose a reason for hiding this comment

Uh oh!

tdabasinskas commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdabasinskas commented May 12, 2026

Uh oh!

aheritier commented May 12, 2026

Uh oh!

aheritier commented May 12, 2026

Uh oh!

docker-agent commented May 12, 2026

Uh oh!

tdabasinskas commented May 13, 2026

Uh oh!

rumpl commented May 13, 2026

Uh oh!

rumpl commented May 13, 2026

Uh oh!

tdabasinskas commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tdabasinskas commented May 4, 2026 •

edited

Loading

tdabasinskas commented May 6, 2026 •

edited

Loading

tdabasinskas commented May 8, 2026 •

edited

Loading